- 
                Notifications
    
You must be signed in to change notification settings  - Fork 13.5k
 
GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators #16018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…building/submission
…o addition merging with addition
F32 Support for Addition Operation
This reverts commit 77f8b96.
          
 I guess one option would be to add a single test that uses an explicit inplace operation (for example   | 
    
| 
           @ggerganov yeah that makes sense. For now I realized it's easy to add a single in-place test per operation, which should add minimal time to   | 
    
…-org#16018) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit 77f8b96. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <[email protected]>
Continuing to add support for more operations. My current plan is to add support for enough operations to run some popular models (open for feedback on which models to focus on) and then go back and work on writing more efficient code the more important operations and dequantization.
In this PR:
binary-head.tmplfor ADD and MUL. Note that I'm kinda rolling my own preprocessing throughembed-wgsl.py, since WGSL doesn't have any preprocessor support.mul.tmpl.wgslandmul_in_place.tmpl.wgsl. While the code is basically the same, I want to note thattest-backend-opsdoes not currently have support for testing the in-place versions. Happy to brainstorm what it would take to add support for testing them.maxComputeInvocationsPerWorkgroupormaxComputeWorkgroupSizeXreported by WebGPU (1024) causes the quantization versions ofGET_ROWStests to fail, and Metal shader validation reported that the max threadgroup size was 704. Since the limits in WebGPU are static, it's not queryable on a per-shader basis. To allow the CI to pass, I have for now hardcoded the workgroup size to 288, and I've also opened an issue with WebGPU (Handling dynamic Metal maxThreadsPerThreadgroup gpuweb/gpuweb#5315) to discuss any possible solutions.